Creating simulated data to test generalized Cohen's kappa

2/11/2021 - Here I will document the process of creating simulated data to test our inter-rater reliability algorithm.

2/25/2021 - Added testing of generalized Cohen's kappa with variation of parameters: number of users, number of codes and number of texts

3/4/2021 - Added new testing of original Cohen's kappa to compare behaviour


First, I coded an algorithm to create simulated codings given some parameters:

  1. Number of codes
  2. Number of coders
  3. Number of texts

Given that we are using non-mutually exclusive categories, each coder can apply one or more codes to each text. To choose which quantity each coder will apply, we will use probabilities. The probability that one coder will apply 1, 2 or 3 codes to one text is set to: [0.7,0.2,0.1]. This means that there is a 70% chance that a coder applies only one code, 20% two codes and 10% one code.

Future work: Extract these probabilities from existing (real) data.

In addition, the code(s) each coder will apply also depend on probabilities. At first we will set equal probabilities for every code and coder. For example, if we have 5 codes, there is a 20% probability of each code being applied to any text.

(maybe) Future work: Try with different probabilities for each code and coder.


The parameters were set to the following values:

  1. Number of codes = 5
  2. Number of coders = 3
  3. Number of texts = 1000

We will use different parameters in future iterations. Using the algorithm simulation.py we obtained 4205 different codings. A coding is one code applied to one text by one coder. These simulated codings were saved to the file simulated_data.csv.

The observed agreement of this simulated data is 0.5988.

Agreement is calculated for each text and code. If more than half the coders or none of them applies one code, then they agree on that code for that text. Given that we are using 5 codes, each text can have from 0 to 5 agreements. We will now count agreements by each text.

Now that we have count of agreement by text, let's visualize how it is distributed.

We can observe that the data is similar to a normal distribution. At this point I realized that I could try different distribution in order to get different observed agreements. To do this I created two algorithms: distribution_generator.py and distribution_selector.py.

The first algorithm generates multiple distributions by varying the skew and the scale of the distribution. The skew goes from -50 to 50 in steps of 0.1, while the scale changes from 1 to 100.

Let's visualize how skew changes a distribution.

We can observe that the distribution moves and slightly alters its shape. Now let's visualize how scale changes a distribution.

We can observe that it changes the height of the distribution.


Distribution generation to obtain different agreements

Using both these variations, the algorithm distribution_generator.py creates 18.000 different distributions with agreements going from 0.01 to 0.99. Then distribution_selector.py selects 101 distributions for each different agreement in steps of 0.01 and adding the distributions for agreement 0 and 1. The algorithm selects the distribution for each agreement that is closer to the normal distribution.

Let's visualize some of these distributions.

In the following visualization we can see all the distributions at the same time.


Generalized Cohen's kappa testing

Using these 101 distributions we use simulation.py again to generate one dataset for each distribution. With each one of these datasets we run Generalized Cohen's kappa and obtain observed agreement, chance agreement and kappa score.

With the following visualization we are going to observe how observed and chance agreement behaves across different agreements.

Now let's look at how generalized Cohen's kappa behaves given this simulated data.

Why do we have kappa = -1.2?

Let's consider the Cohen's Kappa equation: $k = \begin{equation}\frac{P_o -P_e}{1-P_e}\end{equation}$, with $P_o\in[0,1]$ and $P_e\in[0,0.99]$, then:

We can observe that there is a correlation between both scores which is what we expected to see. As future work we will try with different parameters and random seeds.


Testing with different parameters

To test the behaviour of the algorithm we varied different parameters Number of users Number of codes Number of texts

Variation of number of users

We varied the number of users from 2 to 5, maintaining the number of texts as 1000 and number of codes as 5

Let's first look at the variation in chance agreement between these four values as we already know that the observed agreement is calculated the same.

Now let's look at the generalized Cohen's kappa score

Variation of number of codes

We varied the number of codes using the values 3, 5 and 7, maintaining the number of texts as 1000 and number of users as 3.

Let's first look at the variation in chance agreement between these three values as we already know that the observed agreement is calculated the same.

Now let's look at the generalized Cohen's kappa score

Variation of number of total texts

We varied the number of texts using the values 1000, 3000 and 5000, maintaining the number of users as 3 and number of codes as 5.

Let's first look at the variation in chance agreement.

Now let's look at the generalized Cohen's kappa score


What is the behaviour of original Cohen's kappa with simulated data?

I created new simulated datasets that fit the restrictions of Cohen's kappa. Only two users, five codes which are mutually exclusive and 1000 texts.

Let's look at the results

It's really interesting to see that chance agreement is almost constant just as generalized Cohen's kappa. Let's look at the Cohen's kappa score now.

We can observe that it also goes from a negative number but not as low as generalized. Let's compare them side by side

Let's compare the kappa score now

I tried changing the probabilities for which each code is applied to see if that is what affects the chance agreement.

It seems that there is a change. What if we randomly select the probabilities for each agreement? I'm expecting to see a different chance agreement for each initial agreement and not constants as we saw before